In [ ]:
# Copyright 2019 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Overview

Topic

This notebook demonstrates using the AutoML API for Vision Classification.

Audience

The audience for this notebook are software engineers (SWE) with limited experience in machine learning (ML).

Prerequistes

One should be familar with:

- Python 3.X
- Google Cloud Platform (GCP) and using GCP buckets.
- Concept of Image Classification.

Dataset

This notebook using the Kaggle dataset for dog-breeds, located at:

https://www.kaggle.com/c/dog-breed-identification/data

From the Kaggle web page, you will need to select the Download All. Once downloaded to your laptop, you will need to unzip the folder. Once unzipped, there will be two additional zip files: train.zip and test.zip. For our purposes, it will be sufficient to unzip just the train.zip folder, which contains the training data.

This dataset contains train and test images for training an image classifier to classify 120 dog breeds. The train dataset contains approx. 10,0000 images.

These images are not partitioned into folders based on class. Instead, all the images are in one folder and each has a unique label. The CSV file labels.csv has the mapping of each image ID to the label (dog breed). Note that the Id field in the CSV file is not a file path, but just the Id. Later, we will have to make a modified copy and replace the Id with the actual path to the file.

Objective

The objective of this tutorial is to learn how to use the AutoML API to train a model for vision classification, deploy the model and do predictions using a gRPC or REST API interface.

Costs

This tutorial uses billable components of AutoML Vision.

Learn about AutoML Vision Pricing

Set up your local development environment

If you are using Colab or AI Platform Notebooks, your environment already meets all the requirements to run this notebook. You can skip this step.

Otherwise, make sure your environment meets this notebook's requirements. You need the following:

  • The Google Cloud SDK
  • The Google AutoML SDK
  • Git
  • Python 3
  • virtualenv
  • Jupyter notebook running in a virtual environment with Python 3

The Google Cloud guide to Setting up a Python development environment and the Jupyter installation guide provide detailed instructions for meeting these requirements. The following steps provide a condensed set of instructions:

  1. Install and initialize the Cloud SDK.

  2. Install Python 3.

  3. Install AutoML SDK using the pip install google-cloud-automl command in a shell.

  4. Install virtualenv and create a virtual environment that uses Python 3.

  5. Activate that environment and run pip install jupyter in a shell to install Jupyter.

  6. Run jupyter notebook in a shell to launch Jupyter.

  7. Open this notebook in the Jupyter Notebook Dashboard.

Set up your GCP project

The following steps are required, regardless of your notebook environment.

  1. Select or create a GCP project.

  2. Make sure that billing is enabled for your project.

  3. Enable the AI Platform APIs and Compute Engine APIs.

  4. Enter your project ID in the cell below. Then run the cell to make sure the Cloud SDK uses the right project for all the commands in this notebook.

Note: Jupyter runs lines prefixed with ! as shell commands, and it interpolates Python variables prefixed with $ into these commands.

Jupyter runs lines prefixed with % as automagic commands, which are interpreted within your IPython session. Automagic commands include %ls, %pwd, %env and %pip for example.


In [ ]:
PROJECT_ID = "[your-project-id]" #@param {type:"string"}
!gcloud config set project $PROJECT_ID

Authenticate your GCP account

If you are using AI Platform Notebooks, your environment is already authenticated. Skip this step.

If you are using Colab, run the cell below and follow the instructions when prompted to authenticate your account via oAuth.

Otherwise, follow these steps:

  1. In the GCP Console, go to the Create service account key page.

  2. From the Service account drop-down list, select New service account.

  3. In the Service account name field, enter a name.

  4. From the Role drop-down list, select Machine Learning Engine > AI Platform Admin and Storage > Storage Object Admin.

  5. Click Create. A JSON file that contains your key downloads to your local environment.

  6. Enter the path to your service account key as the GOOGLE_APPLICATION_CREDENTIALS variable in the cell below and run the cell.


In [ ]:
import sys

# If you are running this notebook in Colab, run this cell and follow the
# instructions to authenticate your GCP account. This provides access to your
# Cloud Storage bucket and lets you submit training jobs and prediction
# requests.

if 'google.colab' in sys.modules:
  from google.colab import auth as google_auth
  google_auth.authenticate_user()

# If you are running this notebook locally, replace the string below with the
# path to your service account key and run this cell to authenticate your GCP
# account.
else:
  %env GOOGLE_APPLICATION_CREDENTIALS your_path_to_credentials.json

Set Service Account Role for AutoML

Add yourself and your service account to the AutoML Editor IAM role.

- Replace your-userid@your-domain with your user account.
- Replace service-account-name with the name of your new service account, for example service-account1@myproject.iam.gserviceaccount.com.

In [ ]:
!gcloud auth login
!gcloud projects add-iam-policy-binding $PROJECT_ID \
   --member="[user:your-userid@your-domain]" \
   --role="roles/automl.admin"
!gcloud projects add-iam-policy-binding $PROJECT_ID \
   --member="[serviceAccount:service-account-name]" \
   --role="roles/automl.editor"

Allow AutoML service to access for Google Cloud project

Allow the AutoML service account (custom-vision@appspot.gserviceaccount.com) to access your Google Cloud project resources:

This is a pre-existing global AutoML Vision service account which is separate from the project service account you just created; it is not visible in your project's list of service accounts.


In [ ]:
!gcloud projects add-iam-policy-binding $PROJECT_ID \
  --member="serviceAccount:custom-vision@appspot.gserviceaccount.com" \
  --role="roles/ml.admin"
!gcloud projects add-iam-policy-binding $PROJECT_ID \
  --member="serviceAccount:custom-vision@appspot.gserviceaccount.com" \
  --role="roles/storage.admin"
!gcloud projects add-iam-policy-binding $PROJECT_ID \
  --member="serviceAccount:custom-vision@appspot.gserviceaccount.com" \
  --role="roles/serviceusage.serviceUsageAdmin"

Create a Cloud Storage bucket

The following steps are required, regardless of your notebook environment.

When you submit a training job using the AutoML Vision SDK, you must store the training data in a GCS bucket

Set the name of your Cloud Storage bucket below. It must be unique across all Cloud Storage buckets.

You may also change the COMPUTE_REGION variable, which is used for operations throughout the rest of this notebook. Make sure to choose a region where Cloud AI Platform services are available. You may not use a Multi-Regional Storage bucket for training with AI Platform.


In [ ]:
BUCKET_NAME = PROJECT_ID + "-vcm"        #@param {type:"string"}

Only if your bucket doesn't already exist: Run the following cell to create your Cloud Storage bucket.


In [ ]:
# Default compute region for AutoML
COMPUTE_REGION='us-central1'

! gsutil mb -l $COMPUTE_REGION gs://$BUCKET_NAME

Finally, validate access to your Cloud Storage bucket by examining its contents:


In [ ]:
! gsutil ls -al gs://$BUCKET_NAME

PIP Install Packages and dependencies

Install addional dependencies not install in Notebook environment (e.g. XGBoost, adanet, tf-hub)

  • Use the latest major GA version of the framework.

In [ ]:
%pip install -U google-cloud-storage

Tutorial

Import libraries and define constants


In [ ]:
import tensorflow as tf

import numpy as np

# import the Google AutoML client library
from google.cloud import automl_v1beta1 as automl

Copy the sample images into your bucket

Copy the dog-breedf dataset into to your own bucket.

This may take a couple of minutes to complete.


In [ ]:
# Set this variable to the location of the training data on your laptop.
DATASET="[my_path/dog-breed-identification]"
!gsutil -m cp -R [] $DATASET/train gs://$BUCKET_NAME/dog_breeds

Create the CSV file

The sample dataset contains a CSV file with all of the image locations and the labels for each image. You'll use that to create your own CSV file:

1. Update the CSV file to point to the files in your own bucket.
2. Copy the CSV file into your bucket.

To learn more about preparing your training data, see Preparing Training Data


In [ ]:
!cat $DATASET/labels.csv | sed "s:^:gs\://$BUCKET_NAME/dog_breeds/:" | sed "s:,:\.jpg,:" > $DATASET/new_labels.csv
!gsutil cp $DATASET/new_labels.csv gs://$BUCKET_NAME/csv/

Create and Configure an AutoML client instance


In [ ]:
# Create an AutoML client
client = automl.AutoMlClient()

# Derive the full GCP path to the project
project_location = client.location_path(PROJECT_ID, COMPUTE_REGION)

Creating a dataset

A dataset contains representative samples of the type of content you want to classify, labeled with the category labels you want your custom model to use. The dataset serves as the input for training a model.

The main steps for building a dataset are:

- Specify a name for the dataset.
- Create a dataset and specify whether to allow multiple labels on each item.
- Import data items into the dataset.

The first step in creating a custom model is to create an empty dataset that will eventually hold the training data for the model. When you create a dataset, you specify the type of classification you want your custom model to perform:

MULTICLASS assigns a single label to each classified document
MULTILABEL allows a document to be assigned multiple labels

In this tutorial, we are doing a MULTICLASS dataset


In [ ]:
# Specify a name for the dataset
DATASET_NAME="[my-dataset-name]"

# Specify the image classification type for the dataset.
dataset_metadata = {"classification_type": 'MULTICLASS'}
# Set dataset name and metadata of the dataset.
my_dataset = {
    "display_name": DATASET_NAME,
    "image_classification_dataset_metadata": dataset_metadata,
}

# Create a dataset with the dataset metadata in the region.
response = client.create_dataset(project_location, my_dataset)

Display response for creating an empty dataset.


In [ ]:
# Display the dataset information.
print("Dataset name: {}".format(response.name))
print("Dataset id: {}".format(response.name.split("/")[-1]))
print("Dataset display name: {}".format(response.display_name))
print("Image classification dataset metadata:")
print("\t{}".format(response.image_classification_dataset_metadata))
print("Dataset example count: {}".format(response.example_count))

# Save the dataset ID
dataset_id = response.name.split("/")[-1]

Importing items into a dataset

After you have created a dataset, you can import item URIs and labels for items from a CSV file stored in a Google Cloud Storage bucket.


In [ ]:
# Get the full path of the dataset.
dataset_full_id = client.dataset_path(
    PROJECT_ID, COMPUTE_REGION, dataset_id
)

# Specify the location of the CSV file for the dataset
CSV_DATASET = "gs://" + BUCKET_NAME + "/csv/new_labels.csv"
input_config = {"gcs_source": {"input_uris": [CSV_DATASET]}}

# Import data from the input URI.
response = client.import_data(dataset_full_id, input_config)

Display response from initiating the import of images into the dataset. Call will return when import has completed. This may take upto 20 minutes


In [ ]:
# synchronous check of operation status.
print("Data imported. {}".format(response.result()))

Listing Datasets

A project can have multiple datasets, each used to train a separate model. You can get a list of the available datasets and can delete datasets you no longer need.


In [ ]:
response = client.list_datasets(project_location, None)

print("List of datasets:")
for dataset in response:
    # Display the dataset information.
    print("Dataset name: {}".format(dataset.name))
    print("Dataset id: {}".format(dataset.name.split("/")[-1]))
    print("Dataset display name: {}".format(dataset.display_name))
    print("Image classification dataset metadata:")
    print("\t{}".format(dataset.image_classification_dataset_metadata))
    print("Dataset example count: {}\n".format(dataset.example_count))

Training a Cloud-Hosted Model

You create a custom model by training it using a prepared dataset. AutoML API uses the items from the dataset to train the model, test it, and evaluate its performance. You review the results, adjust the training dataset as needed, and train a new model using the improved dataset.

Training a model can take several hours to complete. The AutoML API enables you to check the status of training.


In [ ]:
# Specify a name for your model.
MODEL_NAME="[your-model-name]"

# Set training for a maximum of 1 hour
train_budget=1

# Set model name and model metadata for the image dataset.
my_model = {
    "display_name": MODEL_NAME,
    "dataset_id": dataset_id,
    "image_classification_model_metadata": {"train_budget": train_budget}
}

# Create a model with the model metadata in the region.
response = client.create_model(project_location, my_model)

Display response from initiating the training of the model.


In [ ]:
print("Training operation name: {}".format(response.operation.name))

Display response from initiating the training of the model. Call will return when training has completed. This may take upto 1 hour


In [ ]:
# synchronous check of operation status.
print("Training done. {}".format(response.result()))

# Save the model ID
model_id = response.result().name.split("/")[-1]

Getting information about a model

After training your custom model is complete, you can get information about the newly created model.


In [ ]:
from google.cloud.automl_v1beta1 import enums

# Get the full path of the model.
model_full_id = client.model_path(PROJECT_ID, COMPUTE_REGION, model_id)

# Get complete detail of the model.
model = client.get_model(model_full_id)

# Retrieve deployment state.
if model.deployment_state == enums.Model.DeploymentState.DEPLOYED:
    deployment_state = "deployed"
else:
    deployment_state = "undeployed"

# Display the model information.
print("Model name: {}".format(model.name))
print("Model id: {}".format(model.name.split("/")[-1]))
print("Model display name: {}".format(model.display_name))
print("Image classification model metadata:")
print(
    "Training budget: {}".format(
        model.image_classification_model_metadata.train_budget
    )
)
print(
    "Training cost: {}".format(
        model.image_classification_model_metadata.train_cost
    )
)
print(
    "Stop reason: {}".format(
        model.image_classification_model_metadata.stop_reason
    )
)
print(
    "Base model id: {}".format(
        model.image_classification_model_metadata.base_model_id
    )
)
print("Model deployment state: {}".format(deployment_state))

Evaluating the Model

After training a model, AutoML Vision uses items from the TEST set to evaluate the quality and accuracy of the new model. For more information on how to interpret the evaluation, see Evaluating models


In [ ]:
# Get the full path of the model.
model_full_id = client.model_path(PROJECT_ID, COMPUTE_REGION, model_id)

# List all the model evaluations in the model by applying filter.
response = client.list_model_evaluations(model_full_id, None)

print("List of model evaluations:")
for element in response:
    print(element)